Web document clustering using hyperlink structures
نویسندگان
چکیده
With the exponential growth of information on the World Wide Web, there is great demand for developing e.cient methods for e/ectively organizing the large amount of retrieved information. Document clustering plays an important role in information retrieval and taxonomy management for the Web. In this paper we examine three clustering methods: K-means, multi-level METIS, and the recently developed normalized-cut method using a new approach of combining textual information, hyperlink structure and co-citation relations into a single similarity metric. We found the normalized-cut method with the new similarity metric is particularly e/ective, as demonstrated on three datasets of web query results. We also explore some theoretical connections between the normalized-cut method and the K-means method. c © 2002 Elsevier Science B.V. All rights reserved.
منابع مشابه
Incorporating Hyperlink Analysis in Web Page Clustering
The size of the World Wide Web is growing rapidly and it has become a very important source of information that can be useful to various academic and commercial applications. However, because of the large number of documents online, it is becoming increasingly difficult to search for useful information on the Web. General-purpose Web search engines, such as Google and AltaVista, present search ...
متن کاملVision-Based Deep Web Data Extraction for Web Document Clustering
The design of web information extraction systems becomes more complex and time-consuming. Detection of data region is a significant problem for information extraction from the web page. In this paper, an approach to vision-based deep web data extraction is proposed for web document clustering. The proposed approach comprises of two phases: 1) Vision-based web data extraction, and 2) web documen...
متن کاملUsing Fuzzy Logic Clustering Discover Semantic Similarity in Web Document
The complex and high interactions between terms in documents demonstrates vague and ambiguous meanings. There exist complicated associations within one web document and linking to the others. Most of these approaches perform similarity and feature section methods. There is need of complex document clustering and produced meaningful document. This paper proposed methodology is capable of handles...
متن کاملPerformance Analysis of Vision-based Deep Web Data Extraction for Web Document Clustering
Web Data Extraction is a critical task by applying various scientific tools and in a broad range of application domains. To extract data from multiple web sites are becoming more obscure, as well to design of web information extraction systems becomes more complex and time-consuming. We also present in this paper so far various risks in web data extraction. Identifying data region from web is a...
متن کاملAn adaptive neural network approach to hypertext clustering
The WWW is an on-line hypertextual collection, and a more sophisticated algorithm for Web page clustering may have to be based on combined term-similarity and hyperlink-similarity measures. It has been observed that nearly all currently employed techniques for document classification on the Web make use of textual information only. In addition, most of these techniques are incapable of discover...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Computational Statistics & Data Analysis
دوره 41 شماره
صفحات -
تاریخ انتشار 2002